Infrared Image Super-Resolution Network Utilizing the Enhanced Transformer and U-Net

Infrared images hold significant value in applications such as remote sensing and fire safety. However, infrared detectors often face the problem of high hardware costs, which limits their widespread use. Advancements in deep learning have spurred innovative approaches to image super-resolution (SR), but comparatively few efforts have been dedicated to the exploration of infrared images. To address this, we design the Residual Swin Transformer and Average Pooling Block (RSTAB) and propose the SwinAIR, which can effectively extract and fuse the diverse frequency features in infrared images and achieve superior SR reconstruction performance. By further integrating SwinAIR with U-Net, we propose the SwinAIR-GAN for real infrared image SR reconstruction. SwinAIR-GAN extends the degradation space to better simulate the degradation process of real infrared images. Additionally, it incorporates spectral normalization, dropout, and artifact discrimination loss to reduce the potential image artifacts. Qualitative and quantitative evaluations on various datasets confirm the effectiveness of our proposed method in reconstructing realistic textures and details of infrared images.


Introduction
Infrared imaging utilizes infrared detectors to capture images, differentiating target objects from the background based on the disparities found in radiated heat.This technique effectively addresses the constraints encountered in visible imaging, such as limited object penetration and varying weather conditions, thus playing a pivotal role in diverse fields including climate research [1], disaster warning [2], rescue [3] and resource exploration [4].However, the resolution of infrared images depends on the quantity and size of the dedicated infrared detector array.The hardware limitation can restrict the imaging quality.
Image super-resolution (SR) can transform low-resolution (LR) images into highresolution (HR) ones without enhancing optical detector sensitivity or developing complex optical imaging systems, significantly reducing costs.Recent advancements in deep learning have demonstrated exceptional performance in utilizing SR for image enhancement.
Early deep learning-based image SR methods (e.g., SRCNN [5], VDSR [6], and EDSR [7]) rely on convolutional neural network (CNN) to learn the mapping from LR images to HR images.However, since convolutional kernels can only perceive local information from the input image, it is necessary to stack deep networks to gradually enlarge the receptive field.Therefore, CNN may be limited in handling global information in images.Recently, transformer-based methods [8][9][10][11][12] have become widely popular in the field of image SR.They leverage a self-attention mechanism to effectively capture long-range dependencies and handle multi-scale features in images, thus achieving better SR performance.Despite this, the above methods rely only on custom degradation models, such as bicubic interpolation (BI), for paired training, which tend to overlook the complex degradation processes present in real-world images, resulting in overly smooth SR results.
Nowadays, blind-image SR networks (e.g., BSRGAN [13] and Real-ESRGAN [14]) have gained attention for their ability to handle real-world SR tasks.These methods generate training pairs using unknown degradations, which leads to more robust and generalizable models that better meet real-world requirements.However, these works primarily focus on visible images.Few studies have extended image SR to the field of infrared images.
To address the SR challenge in infrared images, we design the Residual Swin Transformer and Average Pooling Block (RSTAB) and construct the novel Infrared Image Super-Resolution model based on Swin Transformer [8] and Average Pooling [15], SwinAIR.SwinAIR combines the strengths of CNN and transformer, allowing for the sufficient extraction of features from infrared images.To achieve SR reconstruction of real infrared images, we further integrate the SwinAIR with the generative adversarial network (GAN) and build the SwinAIR-GAN, which can recover realistic textures and features matching the real infrared images.The primary contributions of this study are as follows: • We explore a novel infrared image SR reconstruction model, SwinAIR.We introduce the Residual Swin Transformer and Average Pooling Block into the deep feature extraction module of SwinAIR.This configuration effectively extracts both low-and high-frequency infrared features while concurrently fusing them.Comparative results with existing methods demonstrate that our model exhibits superior performance in the SR reconstruction of infrared images.

•
We combine SwinAIR with the U-Net [16] to construct SwinAIR-GAN, further exploring the problem of SR reconstruction for real infrared images.SwinAIR-GAN utilizes SwinAIR as the generator network and employs U-Net as the discriminator network.
Comparative results with similar methods demonstrate that our model can generate infrared images with better visual effects and more realistic features and details.

•
We incorporate spectral normalization, dropout, and artifact discrimination loss to minimize possible artifacts during the restoration process of real infrared images and enhance the generalization ability of SwinAIR-GAN.We also expand the degradation space of the degradation model to emulate the degradation process of real infrared images more accurately.These improvements enable the generation of infrared images that closely align with the textures and details of real-world images while avoiding over-smoothing.

•
We establish an infrared data acquisition system that can simultaneously capture LR and HR infrared images of corresponding scenes, addressing the issue of lacking reference images in the real infrared image SR reconstruction task.
The rest of this paper is structured as follows.Section 2 provides a brief overview of related studies, and Section 3 offers a detailed description of the proposed network.In Section 4, we validate the effectiveness of our network by testing it on various datasets and conducting ablation experiments.Finally, the conclusions are presented in Section 5.

Related Works
Infrared image SR methods are derived from visible image SR methods.In this section, we briefly review the image SR reconstruction research status related to our work.

Traditional Infrared Image Super-Resolution Reconstruction Methods
Traditional infrared image SR methods, similar to visible image SR methods, can be categorized into frequency domain-based, dictionary-based, and other methods.Frequency domain-based methods decompose the infrared image into the spatial domain and the frequency domain [17], processing the image separately in each domain.For example, Choi et al. [18] distinguished the edge pixels of infrared images and utilized visible images as guidance to enhance the high-frequency information of infrared images.Mao et al. [19] transformed the infrared image SR problem into a sparse signal reconstruction problem based on differential operations, significantly enhancing the algorithm's speed.Dictionarybased methods [20][21][22][23][24][25] focus on constructing the mapping relationship between LR and HR image patches.By leveraging the relationships between image patches, these methods retain the pattern information of LR images in HR images.This type of method has garnered researchers' attention for its interpretability.Other traditional methods include iterative methods [26,27], regularization methods [28], and so on.
With the advancement of deep learning technology, transformer and GAN technologies have begun to be widely applied in the field of image SR reconstruction, demonstrating significant advantages.

Transformer-Based Image Super-Resolution Reconstruction Methods
Transformer was initially proposed by Vaswani et al. [29] to address sequence modeling in natural language processing tasks.The self-attention mechanism allows the transformer to effectively capture global dependencies.Additionally, its parallel computing capability accelerates training and enhances efficiency.With the development of vision [30,31], detection [32,33], and video-vision [34,35] transformers, superior results have been achieved by transformers in various visual tasks.
In the field of image SR reconstruction, there are mainly two types of transformer-based frameworks.The first type utilizes the transformer structure entirely, such as the end-to-end network proposed by Chen et al. [9].However, this framework may lack adaptability to biases in some cases.The second type combines transformer with CNN, leveraging the strengths of CNN in capturing local features and transformer in modeling global dependency relationships.This integration enables better capture of both local details and global structures in the image.For example, Liu et al. [8] introduced a backbone Swin Transformer that replaces the standard multi-head self-attention with a shifted-window multi-head self-attention for dense hierarchical feature map prediction, enabling efficient interactions among non-overlapping local windows while addressing the challenges posed by largerscale images and huge training datasets.Liang et al. [10] built upon this idea and proposed SwinIR, which has demonstrated impressive performance in various tasks, including image SR and dehazing.Subsequently, researchers have continuously produced new research results [11,12] with better performance based on this foundation.Cao et al. [36] took SwinIR as the architecture and proposed the CFMB-T model for infrared remote sensing images SR reconstruction.Yi et al. [37] further utilized a hybrid convolution-transformer network for feature extraction, and the suggested HCTIR deblur model has made great progress in infrared image deblurring.
Despite the promising performance of transformer-based methods in image SR reconstruction, strategies that combine the transformer with CNN for deep feature extraction are still limited.This limitation often results in networks overlooking high-frequency information present in images.The problem is particularly pronounced when dealing with infrared images that involve blurring and a lack of high-frequency details.To address this issue, we propose SwinAIR, which better integrates the transformer with CNN to effectively capture both low-and high-frequency information, thereby enhancing the SR reconstruction capability for infrared images.Goodfellow et al. in 2014 [38].It consists of a generator network and a discriminator network that are trained adversarially.The generator network progressively creates more realistic samples, while the discriminator network continuously refines its ability to differentiate between real and artificial images.Training concludes when the discriminator network struggles at a given threshold to distinguish between the two, resulting in high-quality generated samples.

GAN was proposed by
Ledig et al. [39] first applied the GAN to image SR reconstruction and introduced the SRGAN.The SRGAN incorporates perceptual loss, which provides more detailed and lifelike SR images.Subsequently, Yan et al. [40] optimized the image quality assessment approach and used it to improve the loss function of SRGAN, further enhancing its performance.Wang et al. [41] introduced the ESRGAN model in 2018, which incorporates a residual dense block into its generator network and employs a relative discriminator network.By enhancing perceptual loss and introducing interpolation, ESRGAN achieves exceptional perceptual reconstruction effects.Based on the ESRGAN, several newer networks have been developed, including BSRGAN [13], RFB-ESRGAN [42], and Real-ESRGAN [14].
Liu et al. [43] introduced the natural image gradient prior.They used visible images from corresponding scenes as guidance to generate infrared images with improved subjective visual quality.Huang et al. [44] proposed HetSRWGAN, which incorporates heterogeneous convolution and adversarial training.The reconstructed infrared images using this method achieve higher objective evaluation metrics.They also introduced the lightweight PSGAN [45] using a multi-stage transfer learning strategy.Liu et al. [46] designed a generator network with recursive attention modules, significantly improving the SR quality of vehicle infrared images.Lee et al. [47] proposed the Style Transfer Super-Resolution Generative Adversarial Network, STSRGAN, enhancing the quality of LR infrared images containing very small targets.
Although the aforementioned GAN-based methods have shown high potentiality, their training processes are often unstable and result in unintended artifacts.To generate more realistic infrared images, we employ the spectral normalization-based U-Net [16] as the discriminator network and add the artifact discrimination loss function in our SwinAIR-GAN.

Proposed Method
In this section, we will comprehensively introduce the SwinAIR and SwinAIR-GAN along with their technical details.

Overall Structure
The structure of SwinAIR is displayed in Figure 1.Infrared images generally have lower resolution and contain more low-frequency information than visible images.Therefore, the input LR infrared image first undergoes shallow feature extraction and is passed directly to the end of the deep feature extraction network.This helps to better preserve its low-frequency information, thereby improving the performance of SwinAIR.
Concurrently, the shallow features are input to the deep feature extraction module to obtain the deep features.We adopt the hierarchical design of the deep feature extraction module from the Swin Transformer [8].This structure allows for the extraction and integration of features at different scales in the infrared image, helping the network to better learn critical information and reduce noise.
Ultimately, the shallow features and deep features are merged and fed into the upsampling module to generate the SR infrared image.
In the shallow feature extraction module, the LR infrared image is processed by a 3 × 3 convolutional layer to extract shallow feature information as follows: where I LR denotes the LR infrared image, and  1b) and a 3 × 3 convolutional layer are employed to further extract deep features.Residual connections are also used for feature aggregation to obtain deep features, and this process can be simplified as where  Finally, the feature map obtained from the deep feature extraction module is reconstructed layer by layer according to a specified upsampling factor that follows the order of the upsampling module (as shown in Figure 1a).Simultaneously, the upsampled LR infrared image is moved to the end of the upsampling module, where it is added to the reconstructed feature map to produce the final SR infrared image: where H UP (•) represents the upsampling module, I SR is the SR infrared image, and I SR ∈ R H S ×W S ×C S , H S , W S , and C S indicate the height, width, and channel number of the SR infrared image, respectively (similar to I LR ).The specific steps of the upsampling reconstruction process can be expressed as where H C (•) denotes a 3 × 3 convolutional layer, H Near (•) represents the nearest neighbor interpolation, H Bicubic (•) denotes the bicubic interpolation, F UP represents the reconstructed feature map and D(•) signifies the dropout operation.Dropout is a regularization technique originally proposed to address overfitting issues in classification networks.A recent study has demonstrated that dropout can enhance the generalizability of SR networks [48].Therefore, we apply dropout before the last convolutional layer in the upsampling module in SwinAIR with a probability of 0.5.The effectiveness of this approach is confirmed by ablation experiments in Section 4.4.2.

Deep Feature Extraction Module
Drawing inspiration from previous work [8,49], we develop the deep feature extraction module for SwinAIR.It is mainly composed of several RSTABs, which combine the strengths of transformer and CNN to effectively extract low-and high-frequency information from infrared images.
Specifically, the deep feature extraction module consists of n RSTABs and a 3 × 3 convolutional layer with residual connections.The intermediate feature map obtained from the ith RSTAB, F i , and the final output deep feature map, F DF , are given as follows: where F 0 denotes the shallow feature map F SF , and By introducing a convolutional layer at the end of the deep feature extraction module, the inductive bias of convolutional operations is incorporated into the transformer, providing a more robust foundation for the subsequent aggregation of shallow and deep features [10].
As illustrated in Figure 1b, each RSTAB is composed of six Swin Transformers and Average Pooling Layers (STALs) and a 3 × 3 convolutional layer with the residual connection.This allows local features to be transmitted and fused across different layers, enabling the network to deeply learn the mapping relationship between LR and HR infrared images, thereby accelerating the network's learning and convergence speed.
For the given input feature map of the ith RSTAB, F i,0 , the intermediate output feature map can be further defined as follows: where H STAL i,j (•) denotes the jth STAL in the ith RSTAB, and F i,j represents the feature map output in the jth STAL of the ith RSTAB.H C i (•) represents the final 3 × 3 convolutional layer in the ith RSTAB.The internal structure of STAL is shown in Figure 2.Each STAL has two sections: the High-Frequency Feature Extraction Layer (HFEL), which consists of two parallel average pooling and fully connected layers, and the Swin Transformer Layer (STL) [8].The use of channel separation strategies allows the network to independently process low-and high-frequency information in infrared images, mitigating interference during the learning of different structural features.This approach also reduces computational complexity and accelerates the convergence of the network.
Specifically, for the jth STAL of the ith RSTAB, we assume the input feature map ); then, F h and F l are independently fed to the HFEL and the STL.
With the sensitivity of the average pooling layers and the ability of the fully connected layers to capture details [49], the HFEL comprises two parallel paths to perform highfrequency information extraction, each containing an average pooling layer and a fully connected layer.F h is divided into two channel-wise maps 2) .F h1 and F h2 are separately processed in the parallel paths to generate their respective outputs Fh1 , Fh2 as follows: where Avgpool(•) signifies the average pooling layer and FC(•) symbolizes the fully connected layer.To decrease the computational burden associated with the standard transformer for image processing, we use STL for low-frequency feature extraction.As shown in Figure 2, STL primarily consists of the self-attention module based on the shift-window mechanism (SW-MSA) and the multi-layer perceptron module (MLP).SW-MSA calculates multi-head self-attention within local regions of the image by shift window.This process captures the correlations between different local regions of the image, thereby extracting local structures and texture information.MLP combines non-linear transformations to better capture the semantic information of the image.For F l , the processing process can be described as shown below: where Fl represents the feature map after processing by the STL.LN(•) denotes the layer normalization operation, H MSA represents the SW-MSA module, and H MLP represents the MLP module.Finally, the low-frequency information feature map after processing by the STL, Fl , and the high-frequency information feature maps after processing by the HFEL, Fh1 , Fh2 , are fused along the channel dimension to get F i,j , which can be represented as where FU(•) represents the fusion operation.

Loss Function
SwinAIR uses L1 loss as the loss function.The L1 loss calculates the pixel-wise differences between SR and HR images as follows: where h, w, and c denote the height, width, and channel number of the image, respectively.I SR i,j,k and I HR i,j,k represent the pixel values of the SR and HR images at the spatial location, (i, j, k), respectively.

SwinAIR-GAN
To generate infrared images that more closely align with real physical characteristics, we further design SwinAIR-GAN based on SwinAIR.The proposed SwinAIR-GAN structure for real infrared image SR is depicted in Figure 3.We use SwinAIR as the generator network and a spectral normalization-based U-Net as the discriminator network.

Generator Network
The generator network utilizes our proposed SwinAIR, which effectively extracts and reconstructs information from infrared images.Detailed information about SwinAIR can be found in Section 3.1.

Discriminator Network
The discriminator network (as shown in Figure 4) employs a U-Net architecture.However, the complex degradation space and structure can increase training instability.
To mitigate this risk, spectral normalization [50] is used to stabilize training.This approach can also alleviate the introduction of unnecessary artifacts during GAN training, thereby achieving a balance between local image detail enhancement and artifact suppression.
The discriminator network primarily comprises 10 convolutional layers, 8 spectral normalization layers, and 3 bilinear interpolation layers.Each of the first 9 convolutional layers is followed by a Leaky ReLU activation function.Table 1 provides the detailed information about each convolutional layer in the discriminator network.

Loss Function
Appropriate loss functions are crucial to quantifying the differences between SR and HR images and optimizing network training.Conventional networks use a combination of per-pixel, perceptual, and adversarial losses in real-image SR tasks.To enhance image details and eliminate artifacts, the SwinAIR-GAN incorporates additional artifact discrimination loss.Below, we will discuss these different loss functions in more detail.
L1 Loss.The information about L1 loss can be found in Section 3.1.3.Perceptual Loss.In contrast to L1 loss, perceptual loss aims to make SR infrared images more visually similar to HR infrared images rather than just minimizing pixel differences.This process is expressed as follows: where h l , w l , and c l denote the height, width, and channel numbers of the lth feature map, respectively.ϕ l i,j,k (I SR ) and ϕ l i,j,k (I HR ) represent the pixel values of the SR and HR feature maps extracted from the lth layer of the pretrained network at the spatial location (i, j, k), respectively.
Adversarial Loss.Adversarial loss is primarily used to generate realistic SR images.The adversarial losses of SwinAIR-GAN are represented as follows: where L adversarial_G and L adversarial_D represent the adversarial losses of the generator network and discriminator network, respectively.D θ (I SR ) and D θ (I HR ) denote the probabili- ties the SR and HR images are trained to be classified as real images, respectively.Artifact Discrimination Loss.Artifact discrimination loss is used to minimize SR image artifacts [51] as follows: where I SR1 refers to the SR image obtained through the generator network, and M refine represents the refined feature map.Overall loss function.Taking the aforementioned losses into account, the comprehensive loss function of the SwinAIR-GAN is as follows: where α, β and γ represent the proportionate contributions of each loss type (i.e., L 1 , perceptual, adversarial and artifact discrimination types) to the comprehensive loss function.

Degradation Model
Classical SR networks commonly employed bicubic downsampling to simulate image degradation.However, this approach has difficulty dealing with complex and unknown degradations in real infrared images.Issues such as blurring and ghosting are commonplace.
To tackle this problem, we extend the degradation space (as shown in Figure 5) to more effectively simulate the process real infrared image degradation.Typically, image degradation can be represented by a general model, such as the following: where k denotes the blur kernel, ⊗ denotes the convolution operation, ↓ denotes the downsampling operation, s is the scaling factor, and n δ represents noise (typically additive white Gaussian) with a standard deviation of δ.
To expand the degradation space, we consider both first-and second-order degradations.For the first-order degenerate process, general and shuffled degradations are applied.The general degradation refers to the sequential application of blur, downsampling, and noise in sets of two or three.The shuffled degradation involves the random ordering of the degradation steps.For the second-order degenerate process, the same two degradation types are considered.The general degradation comprises two first-order degradation processes, while the shuffled degradation involves randomly shuffling blur, downsampling, and noise for each of the two first-order degradation processes.
In the degradation model, blur, noise, and downsampling are the three key factors affecting image degradation.By expanding the degradation space of these factors, true degradation processes are approximated to improve the applicability and generalizability of the network.Our proposed degradation model considers these key factors that affect the degradation of infrared images as follows: Blur.In our model, isotropic and anisotropic Gaussian blur kernels are used.A Gaussian blur kernel k(i, j) with a kernel size of 2t + 1 can be represented as follows: where ∑ denotes the covariance matrix, C represents the spatial coordinates, and N is a normalization constant, (i, j) ∈ [−t, t], and is sampled from a Gaussian distribution.∑ can be further expressed as where µ 1 and µ 2 represent the standard deviation along two main axes (i.e., the eigenvalues of the covariance matrix), and θ is the rotation angle.When µ 1 = µ 2 , k(i, j) is an isotropic Gaussian blur kernel.Otherwise, k(i, j) is an anisotropic Gaussian blur kernel.The Gaussian blur kernel configurations are hyperparameters.Different settings affect the blurring of infrared images during degradation and the complexity of the model.Based on previous works (e.g., BSRGAN [13] and Real-ESRGAN [14]) and the characteristics of infrared images, we experiment with various settings combinations and ultimately determine the optimal settings.
For the kernel size, we uniformly sample from [7 × 7, 9 × 9, • • •, 21 × 21] to obtain varying degrees of blurring.For the isotropic Gaussian kernel, we uniformly sample the standard deviation from [0.1, 2.8].For the anisotropic Gaussian kernel, we uniformly sample the standard deviation from [0.5, 6] and [0.5, 8] with equal probability, and we uniformly sample the rotation angle from [0, π] to simulate motion blurring at different angles.
By randomly using these Gaussian kernels with different configurations during the degradation, we achieve a variety of complex blurring effects that closely simulate the degradation process of real infrared images, thus effectively optimizing the performance of SwinAIR-GAN.
Noise.The noise in infrared images can usually be classified into two categories: random noise and fixed pattern noise.Random noise includes two types of stationary random white noise: Gaussian noise and Poisson noise.Fixed pattern noise is primarily characterized by image non-uniformity and blind elements.Non-uniform and scanning-array blind elements contribute to multiplicative noise, whereas staring-array blind elements result in salt-and-pepper noise.Therefore, we consider four types of noise for processing: Gaussian noise, Poisson noise, multiplicative noise, and salt-and-pepper noise.We also include JPEG compression noise to expand the degradation space.
Gaussian noise is random noise with a probability density function (PDF) that follows a Gaussian distribution.The PDF of Gaussian noise is given by where x represents the grayscale value of the image noise, µ represents the mean value of x, and σ represents the standard deviation of x.
Poisson noise follows a Poisson distribution, and its intensity is proportional to that of the image.The noises at different pixel points are independent of one another.The PDF of Poisson noise is given by where λ is the expectant value, and λ > 0.
Multiplicative noise is produced by random variations in channel characteristics and is related to the signal through multiplication.This type of noise follows a rayleigh or gamma distribution.In infrared images, the multiplicative noise caused by the non-uniformity is numerous and mixed together in a unified form.In a focal plane detector array, the response of each detection element to a mixture of various non-uniform factors can be expressed as follows: where x represents the response of the detection element, I represents the incident infrared radiation, and M represents the number of detection elements in the array.The nonuniformity of each detection element is influenced by the combined impact of the gain a i , and offset, b i , of each unit.Salt-and-pepper noise is caused by changes in signal pulse intensity.It includes two variations: high-intensity salt-like noise represented by white dots and low-intensity pepper-like noise represented by black dots.Typically, they both occur simultaneously.Dieickx et al. [52] suggested that the blind elements of a staring array manifest as salt-andpepper noise.The PDF of salt-and-pepper noise in infrared images can be represented as follows: where P a and P b represent the density distributions of pepper noise and salt noise, respectively.
JPEG compression is a widely used lossy compression technique for digital images, which reduces storage and transmission requirements by discarding unimportant components.The quality of the compressed image is determined by the quality factor q (q ∈ [0, 100]).A higher q value corresponds to a lower compression ratio and better image quality.When q > 90, the image quality is considered great, and there will be no obvious artifacts introduced.In our degradation model, the range for the JPEG quality factor is set to within [30,95].
Downsampling.Downsampling is a fundamental operation used to change the image size and generate LR images.There are four interpolation methods for image resizing: nearest neighbor, bilinear, bicubic, and area adjustment.Nearest neighbor interpolation introduces pixel deviation issues [53]; hence, we focus primarily on the other three.
The different resizing methods have their own effects on images.Our degradation model randomly selects from downsampling, upsampling, and maintaining the original size to diversify the degradation space.Assuming that the scaling factor is s, the image size is first changed by a (a ∈ [1/2, s]), resulting in an intermediate-sized image.After degra- dation, the image is resized by a scaling factor of s/a, causing the image size to match the downsampling scope required by the original image.

Experimental Details and Results
In this section, we provide a detailed introduction to the experimental details and validation results of SwinAIR and SwinAIR-GAN.Both quantitative and qualitative analyses demonstrate the superiority of our proposed method for infrared image SR reconstruction.

Datasets and Metrics
Training sets.Due to the scarcity of high-quality public infrared image datasets available for training, we utilize visible image datasets as the training set based on previous work [45].Specifically, we use the DV2K [54] (DIV2K [55] + Flickr2K [7]) dataset, which contains 3450 images rich in texture and detail.
The purpose of creating our own dataset is to address the challenge of lacking reference images in real infrared image SR tasks, which makes it difficult to thoroughly compare the texture and detail recovery effects of different methods.Specifically, we build an infrared data acquisition system consisting of an LR Iray infrared camera and two HR Chengdu Jinglin infrared cameras.The parameters and models of the cameras are detailed in Table 2.The specific process of acquiring infrared image data using the above-mentioned system is shown in Figure 6.We place three cameras parallel to each other with the LR Iray infrared camera in the middle and the two HR Chengdu Jinglin infrared cameras on the sides.Different degradation methods are applied to infrared images captured by the Iray camera across various scenes to construct LR-HR image pairs (in the custom-degraded SR task, the infrared images captured by the Iray camera are considered as HR images relative to the degraded images).Simultaneously, we perform parallax correction on the images captured by the two Chengdu Jinglin infrared cameras and then stitch them together to obtain images with a higher resolution than the original ones captured by the Iray camera.These higher-resolution images will be used as reference images in real infrared SR tasks (i.e., direct SR reconstruction of images captured by the Iray camera) to evaluate the effectiveness of different methods in reconstructing details and textures from unknown degraded infrared images.valuation metrics.To quantitatively evaluate the SR performance of our proposed method, we use full-reference image quality evaluation metrics, peak signal-to-noise ratio (PSNR), and structural similarity (SSIM) for SwinAIR.To further evaluate the realism of the generated infrared images of SwinAIR-GAN, we also use non-reference image quality evaluation metrics, natural image quality evaluator (NIQE), and perceptual index (PI), in addition to PSNR and SSIM.All evaluation metrics are computed on the luminance channel after converting the reconstructed images to the YCbCr color space.We assess the efficiency of SwinAIR using floating-point operations (FLOPs) and inference speed.

Model and Training Settings
Model settings.For SwinAIR, the number of RSTABs (n) and the number of STALs (m) are set to 6.The window size of the SW-MSA in STL is 8 with the number of attention heads set to 6.The weights for the high-frequency channel (F h ) and the low-frequency channel (F l ) are set to 1/3 and 2/3, respectively.For SwinAIR-GAN, the degradation model parameters are detailed in Section 3.2.4.The generator network parameters are the same as those of SwinAIR, and the discriminator network settings are provided in Table 1.
Training settings.We use the Adam method to optimize our SwinAIR and SwinAIR-GAN, with the three hyperparameters of Adam, β 1 , β 2 , and ε, set to 0.9, 0.999, and 10 −8 , respectively.The initial learning rate is set to 2 × 10 −4 and decays using a multi-step strategy.The number of iterations for model updates is set to 6 × 10 5 .All trainings are implemented in PyTorch and conducted on a server equipped with 8 Nvidia A100 GPUs with a batch size of 32.The input LR image size is set to 64 × 64.During the training process, the dataset is augmented using seven data augmentation techniques, including random horizontal flipping, random vertical flipping, and random 90°rotations.

SwinAIR 4.3.1. Comparisons with the State-of-the-Art Methods
We compare our SwinAIR method with existing mainstream classical image SR methods on multiple datasets to evaluate their performance in reconstructing infrared images.All LR images are degraded using bicubic downsampling.
Quantitative Evaluations.We evaluate the SR reconstruction performance of SwinAIR and other methods on infrared images across six datasets at different scales.The quantitative comparison results are shown in Table 3.Our network achieves superior reconstruction performance with fewer parameters.At the ×2 and ×3 scales, our method achieves the best metrics across all datasets except for the CVC-14 dataset [56], where the results are second best.At the ×4 scale, our method achieves the best performance across all test sets while reducing the parameters by 19% compared to SwinIR.We also compare the efficiency of SwinAIR with several well-performing methods at the ×4 scale, and the results are shown in Table 4.The FLOPs and inference speed measurements are performed on an NVIDIA A100 GPU with the input image size set to 64 × 64.To ensure the accuracy of the inference speed measurement, we first perform 50 warm-up inferences, which are followed by 100 inferences to calculate the average inference time.Compared to CNN-based methods (e.g., EDSR [7] and SRFBN [62]), transformer-based methods (e.g., SwinIR [10], HAT [12] and SwinAIR) achieve better performance at the cost of reduced inference speed.Nonetheless, the use of the shift-window mechanism can effectively reduce their computational complexity.Additionally, due to the implementation of channel separation strategies, SwinAIR achieves the lowest computational complexity among the compared methods.
Table 3. Quantitative comparison results achieved by different methods on the CVC-14 [56], Flir [57], Iray-384 [58], Iray-ship [59], Iray-aerial photography [60], and Iray-security [61] datasets.We use PSNR (dB)/SSIM * as the evaluation metrics.Params represents the parameters of the methods.Bold and underlined numbers indicate the best and second-best results, respectively.Visual comparison.We compare the visual reconstruction results of different methods at the ×4 scale on different datasets.Compared with other methods, our proposed SwinAIR demonstrates superior visual effects and is able to better restore accurate details and textures.

Scale
For instance, in the restoration of fence details (Figures 7 and 8), bicubic, SRCNN [5], and EDSR [7] produce visually less appealing blurry images.HAT [12] generates overly sharpened images with speckled and serrated textures.SwinIR [10] and SwinAIR perform better overall, but SwinIR [10] introduces distortions and twists at intersections of distant fence lines.Only SwinAIR can reconstruct such detailed features consistently with reference images.In the restoration of ground textures (Figure 9), all methods except SwinAIR fail to accurately restore the closely spaced horizontal lines at the bottom.Similar situations can also be observed in the contours of cars and ships (Figures 10 and 11), and the lines of door frames (Figure 12), where SwinAIR shows superior capability in recovering fine and contour features.This demonstrates that handling different frequency features separately can effectively enhance the network's performance.

Ablation Study
To further validate the effectiveness of each component in SwinAIR and the rationality of parameter selection, we conduct extensive ablation experiments.The models are retrained on the DV2K [54] dataset and reevaluated on the CVC14 dataset [56] at the ×4 scale.
Number of RSTABs.Initially, as the number of RSTABs increases, the network's ability to capture features continuously improves, and its performance steadily enhances.However, when the number of RSTABs increases further, the network performance gains begin to saturate while parameters continue to increase rapidly.The relation between network performance and parameters with varying RSTABs is shown in Figure 13a.To balance the performance and parameters of SwinAIR, we ultimately set the number of RSTABs to 6.
Number of STALs in RSTAB.As shown in Figure 13b, the performance of SwinAIR starts to plateau when the number of STALs reaches 4 and saturates when the number reaches 6.There is little difference in performance between having 6 and 8 STALs.Therefore, we ultimately set the number of STALs to 6.
Number of attention heads.Using more attention heads can enhance the network's ability to capture complex features, but it also introduces additional computational overhead.By analyzing the impact of different numbers of attention heads on network performance, as shown in Figure 13c, we find that setting the number of attention heads to 6 achieves a good balance between performance and parameters.
Different weights of the high-frequency and the low-frequency channels.The selection of channel separation weights is related to the distribution of high-frequency and lowfrequency information in infrared images.As shown in Figure 13d, we notice that as the weight of the low-frequency channel increases, the network performance continuously improves.This indicates that there is still a substantial amount of learnable low-frequency information in the images.However, due to the high complexity of the low-frequency feature processing structure (STL), the model's parameters also increase accordingly.When the low-frequency channel weight exceeds 2/3, further increasing the weight provides minimal performance gain.Therefore, we ultimately set the high-frequency channel weight to 1/3 and the low-frequency channel weight to 2/3.
Different datasets have varied devices, shooting environments, and objects, which may affect the optimal selection of channel separation weights.To further verify the reasonableness of channel separation weight settings, we perform frequency domain analysis on different datasets.Specifically, we conduct the Fourier transform on the images in each dataset and compute the average spectrum to represent the overall distribution of frequency information in the dataset.The comparison results are shown in Figure 14.Although the spectrograms of different datasets vary in detail, the overall range and intensity of frequency information distribution are similar.This demonstrates that the channel separation weights can be applied to different types of infrared images.
Different structures of the HFEL.Pooling layers in CNNs can reduce data dimensionality and extract high-frequency information from images.We validate the impact of using different pooling layers and different numbers of branches within the HFEL on SwinAIR's performance.The results are shown in Table 5.The results indicate that the average pooling layer outperforms the max pooling layer.Additionally, the structure with two parallel pooling and fully connected layers significantly outperforms the single-branch structure.Interestingly, we find that using both average and max pooling layers together does not perform as well as using either one alone.We hypothesize that this is due to the mismatch and difficulty in fusing features extracted by different pooling layers.Therefore, we ultimately adopt a structure with two parallel average pooling and fully connected layers in the HFEL.We compare the SwinAIR-GAN method with other GAN-based image SR methods across multiple datasets to thoroughly evaluate its performance in restoring real infrared images.For the sake of fairness, we compare methods only with similar types of loss functions due to the strong correlation between evaluation metrics and the loss functions of GAN-based image SR methods [64].
Quantitative Evaluations.We evaluate the SR reconstruction performance of SwinAIR-GAN and other GAN-based methods on infrared images across three datasets.We validate the image restoration performance under both bicubic downsampling (BI degradation) and Gaussian blur downsampling (BD degradation) scenarios at the ×4 scale.The quantitative comparison results are shown in Table 6.Additionally, we perform the ×4 scale SR reconstruction directly on the images in the dataset to validate the SR reconstruction performance of different methods on unknown degraded real scenes.We use no-reference evaluation metrics, NIQE and PI, for quantitative analysis.The quantitative comparison results are shown in Table 7. Apart from achieving second-best results on the ASL-TID dataset [57] for unknown degraded real scenes, SwinAIR-GAN outperforms other models of the same category across all other datasets under various degradation conditions.
Visual comparison.We compare the visual reconstruction results of different methods for real infrared images on multiple datasets at the ×4 scale.Compared with other methods, our proposed SwinAIR-GAN is able to restore details and textures more accurately and achieve better visual effects closer to the real images.For example, in Figure 15, only SwinAIR-GAN restores texture details consistent with the real scene.Other methods either introduce a lot of noise (e.g., ESRGAN [41], RealSR [54]) or generate incorrect details and textures (e.g., Real-ESRGAN [14], BSRGAN [13]).In other scenarios, our method recovers texture details that are more visually accurate and consistent with the LR image (e.g., the restoration of window lines on the left side in Figure 16 and the restoration of the person's arm in Figure 17).

Ablation Study
We conduct further ablation experiments on SwinAIR-GAN to validate the effectiveness of the dropout operation and the artifact discrimination loss.The models are retrained on the DV2K [54] dataset and reevaluated on the self-built dataset at the ×4 scale.We quantitatively assess the experimental results under custom degradation (using PSNR (dB)/SSIM as reference metrics) and unknown degradation (using NIQE/PI as reference metrics).
Effectiveness of the dropout operation.The dropout operation included in the SwinAIR-GAN lead to improved performance across all evaluation metrics compared with the same model without the dropout operation.The specific results can be found in Table 8.For real infrared images, the dropout operation results in an NIQE reduction of 0.2964 and a PI reduction of 0.1173.Furthermore, there are significant improvements in PSNR and SSIM under various degradations.Effectiveness of the artifact discrimination loss.The SwinAIR-GAN incorporating the artifact discrimination loss exhibits lower NIQE/PI scores in the real infrared SR task and higher PSNR (dB) /SSIM scores in the custom-degraded SR task than the model without.The detailed results are shown in Table 9.  2 Higher PSNR (dB) /SSIM mean better performance.

Conclusions
In this study, we devise a novel method, SwinAIR, for infrared image SR reconstruction.Through our carefully designed RSTAB, SwinAIR is capable of extracting both lowand high-frequency information from infrared images, enabling effective resolution enhancement.Building on this foundation, we further explore the SR reconstruction task for real infrared images by integrating SwinAIR with the U-Net to create SwinAIR-GAN.SwinAIR-GAN expands the degradation space to better simulate the process of real infrared image degradation.In addition, spectral normalization, dropout operation, and artifact discrimination loss are added to further enhance its performance.To better evaluate the reconstruction effects of different methods on real infrared images, we build an infrared data acquisition system.This system captures corresponding LR and HR infrared images of the same scene, addressing the lack of reference images in real infrared image SR tasks.Extensive comparative experiments on various datasets demonstrate the effectiveness of SwinAIR and SwinAIR-GAN.Although our method achieves superior performance and delivers visually striking results, there may still be differences between the reconstruction images by SwinAIR-GAN and the real infrared images due to the lack of high-quality infrared datasets [65].Recent studies have shown that transfer learning can effectively improve network performance under the scarcity of datasets [45,66,67].In the future, we will consider employing this technique, as well as expanding the training dataset, to further explore new methods for the real infrared image SR reconstruction task.

Figure 1 .
Figure 1.Structure of our proposed Infrared Image Super-Resolution model based on Swin Transformer and Average Pooling, SwinAIR.Firstly, the input LR infrared I LR undergoes the shallow feature extraction module H SF to extract the shallow feature map F 0 ; then, F 0 is deep extracted and refined by the deep feature extraction module.We obtain F n after n RSTABs.After F n undergoes convolutional operations, it is further combined with F 0 through residual connection to obtain the deep feature map F DF .Finally, F DF is input into the upsampling module H UP to generate the output SR infrared image I SR .The interpolation operation here is bicubic.

Figure 2 .
Figure 2. Structure of the Swin Transformer and Average Pooling Layer (STAL).There are two ways to combine CNN with the transformer: serial or parallel.The serial method means that each layer can process either low-or high-frequency information but not both.Therefore, to allow each layer to process both types of information simultaneously, a parallel structure with channel separation is used to integrate the CNN and transformer.The feature map is initially divided into F h and F l , which are then separately fed into the High-Frequency Feature Extraction Layer (HFEL) and the Swin Transformer Layer (STL).In STL, SW-MSA and MLP represent the self-attention module based on the shift-window mechanism and the multi-layer perceptron module, respectively.

Figure 3 .
Figure 3. SwinAIR-GAN method.The generator network generates the SR infrared feature map SR1 from the LR infrared image and applies an exponential moving average (EMA) to the parameters to obtain the infrared feature map SR2.Various loss functions are then computed for SR1, SR2, and the HR infrared image.Finally, the corresponding discriminator network judge the authenticity of the generated infrared image.

Figure 4 .
Figure 4. Structure of the discriminator network.We adopt a U-Net structure based on spectral normalization.

Figure 5 .
Figure 5.The degradation model.We expand the degradation space and consider first-and secondorder degradation processes to emulate the real degradation of infrared images more accurately.

Figure 6 .
Figure 6.Our infrared data acquisition system.(a) System hardware.(b) Process of dataset construction.For SR tasks with custom degradation, images captured by the Iray camera are used as HR reference images and to calculate metrics.Due to differences in images captured by different cameras, we use no-reference metrics for the quantitative analysis of SR reconstruction quality for real SR tasks.The HR images captured by the Jinglin Chengdu cameras are stitched together to serve as only reference images for real SR tasks, assessing whether the SR reconstructed images are consistent with real textures and details.

Figure 7 .
Figure 7. Visual comparison results achieved by different methods on the CVC14 dataset [56] at the ×4 scale.GT represents the original HR image in the red box of the leftmost image.Our proposed method can restore more realistic fence details.

Figure 8 .
Figure 8. Visual comparison results achieved by different methods on the Iray-384 dataset [58] at the ×4 scale.GT represents the original HR image in the red box of the leftmost image.Our proposed method can restore more realistic fence details.

Figure 9 .
Figure 9. Visual comparison results achieved by different methods on the Flir dataset [57] at the ×4 scale.GT represents the original HR image in the red box of the leftmost image.Our proposed method can restore closely spaced ground textures.

Figure 10 .
Figure 10.Visual comparison results achieved by different methods on the Iray-security dataset [61] at the ×4 scale.GT represents the original HR image in the red box of the leftmost image.Our proposed method can restore more realistic vehicle contours.

Figure 11 .
Figure 11.Visual comparison results achieved by different methods on the Iray-ship dataset [59] at the ×4 scale.GT represents the original HR image in the red box of the leftmost image.Our proposed method can restore more realistic ship contours.

Figure 12 .
Figure 12.Visual comparison results achieved by different methods on the Iray-aerial photography dataset [60] at the ×4 scale.GT represents the original HR image in the red box of the leftmost image.Our proposed method can restore more realistic lines of door frames.

Figure 13 .
Figure 13.We evaluate the impact of different module configurations and parameter selections on the performance of SwinAIR using PSNR (dB), SSIM, and the number of parameters (M) as performance metrics.In the figures, params represent the number of parameters, H and L denote the weights of high-frequency and low-frequency channels, respectively.(a) The impact of the number of RSTABs on the performance of SwinAIR.(b) The impact of the number of STALs within each RSTAB on the performance of SwinAIR.(c) The impact of the number of attention heads in the STL on the performance of SwinAIR.(d) The impact of different weights of the high-frequency and the low-frequency channels on the performance of SwinAIR.

Figure 14 .
Figure 14.Frequency domain analysis of different datasets.We perform the Fourier transform and spectrum centralization on the images in the datasets.The central part of the spectrogram represents low-frequency information, while the edges represent high-frequency information.The overall range of the spectrum distribution is similar across different datasets.

*
Higher metrics mean better performance.

Figure 15 .
Figure 15.Visual comparison results achieved by different methods for real infrared images on the self-built dataset at the ×4 scale.GT represents the reference image in the red box of the leftmost image, which is obtained by stitching images from two HR cameras.We perform SR reconstruction on the corresponding real-scene images captured by the LR camera and observe the texture differences with the GT images to evaluate performance.

Figure 16 .
Figure 16.Visual comparison results achieved by different methods for real infrared images on the Iray-384 dataset [58] at the ×4 scale.LR represents the original image in the red box of the leftmost image.

Figure 17 .
Figure 17.Visual comparison results achieved by different methods for real infrared images on the ASL-TID dataset [57] at the ×4 scale.LR represents the original image in the red box of the leftmost image.
and C L indicate the height, width, and channel number of the LR infrared image, respectively.H SF (•) represents the shallow feature extraction module, and F SF represents the shallow feature map.In the deep feature extraction module, multiple stacked RSTABs (as shown in Figure H DF (•) corresponds to the deep feature extraction module, and F DF represents the deep feature map.The deep feature extraction module not only expands the receptive field by stacking convolutional layers but also combines the advantages of both transformer and CNN to effectively extract low-and high-frequency image information.With local and global residual learning, the problem of gradient degradation in deep networks is mitigated, significantly reducing the parameters and overall training time.

Table 1 .
Detailed information about the convolutional layers in the discriminator network.

Table 2 .
Detailed parameters of the infrared data acquisition system.

Table 4 .
Efficiency comparison results achieved by different methods.FLOPs (G) and inference speed (ms) * are used as metrics for evaluating computational complexity and real-time performance, respectively.
* Lower metrics mean better performance.

Table 5 .
Comparison results of using different types of pooling layers and different numbers of branches within the HFEL.
* Higher metrics mean better performance.

Table 8 .
Comparison results of whether use dropout operation or not.

Table 9 .
Comparison results of whether use artifact discrimination loss or not.